Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WSLowering] consumer release on each thread instead of master thread #10

Merged
merged 2 commits into from
Jan 14, 2025

Conversation

htyu
Copy link
Contributor

@htyu htyu commented Dec 16, 2024

Having each thread run the consumer release operation simplifies the logics by avoiding master thread id computation. This seems to help improve performance a bit.

  • 2% speedup for triton_tutorial_flash_v2_tma_ws
  • 0.3% speedup for matmul_layernorm_ws

@htyu htyu requested review from bertmaher and manman-ren December 16, 2024 17:20
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 16, 2024
remoteCTAId, false, 0);

auto arriveOp = builder.create<ttng::MBarrierArriveOp>(
loc, bufferEmpty, nullptr, nullptr, false, 0);
assert(op.getOperation()->hasAttr("async_task_id"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't quite get this logic around remote-cta. It seems this change gets rid of the remote-cta mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this gets rid of remote-cta. Do you think it is still useful? I'm not sure how it would be used.

unsigned bufferEmptyCount = numCTAs;
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView, numCTAs);
builder.create<ttng::InitBarrierOp>(loc, barrierEmptyView,
THREADS_PER_TASK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this changes from a barrier across CTAs to a barrier within the warp group of 128 threads?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes from expecting a master thread form a WG running the barrier arrival to all threads within the WG running it.

@manman-ren
Copy link
Contributor

Nice perf win!

@htyu htyu merged commit 0eba80a into ws Jan 14, 2025
2 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants